Predicting Appropriate Semantic Web Terms from Words
نویسندگان
چکیده
The Semantic Web language RDF was designed to unambiguously define and use ontologies to encode data and knowledge on the Web. Many people find it difficult, however, to write complex RDF statements and queries because doing so requires familiarity with the appropriate ontologies and the terms they define. We describe a system that suggests appropriate RDF terms given semantically related English words and general domain and context information. We use the Swoogle Semantic Web search engine to provide RDF term and namespace statistics, the WordNet lexical ontology to find semantically related words, and a naïve Bayes classifier to suggest terms. A customized graph data structure of related namespaces is constructed from Swoogle's database to speed up the classifier model learning and prediction time. Motivation and Objectives The Semantic Web is realized as a huge graph of data and knowledge. The graph’s building blocks consist of literal values and RDF terms representing classes, properties and individuals. Syntactically, an RDF term is expressed as a URI like http://xmlns.com/foaf/0.1/Person, which has two parts: a namespace (http://xmlns.com/foaf/0.1/) identifying the ontology defining the term and a local name (Person) selecting a term in the ontology. The use of namespaces avoids introducing ambiguity and allows two terms in different ontologies to share a local name. For example, the class Party may be defined in an ontology about politics as well as in another describing daily lives. Qualifying Party with a namespace removes the ambiguity. Authoring or querying knowledge Semantic Web is difficult because it requires people to select an ontology from man for the concepts they want to use. The term Party, for example, is defined in 352 Semantic Web ontologies known to Swoogle. Moreover, other ontologies define possibly related concepts, with local names Celebration and Organization. It would be convenient if a user could use natural language words as her vocabulary and an knowledgeable system would suggest appropriate Semantic Web terms based both on the observed experience data of how people used different namespaces together and on the user’s prior namespaces or domain information. Such a system could support metadata systems like Flickr’s machine tags (Schmitz 2006), which also require qualified namespaces whose selection is hard for end users who know nothing or little about ontologies. Copyright © 2008, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Problem Analysis Given an RDF local name, how do we know which namespaces are suitable to qualify it? A simple solution is to ask a Semantic Web search engine, like Swoogle (Ding et al. 2004) to return the most highly ranked public namespaces that define the term. However, this is based on an exact string match. In many cases, we might desire terms whose local names are semantically related. For example, if a user gives the verb associate and no popular namespace defines a matching term, but the word relate is defined in a very popular namespace, then generally we can substitute relate for associate. Thus, for a given input word we find a set of synonyms or semantically related words from WordNet (Miller 1998). For each one, we use Swoogle to find ontologies that define a term with a corresponding local name. The selection process involves a threshold based on Swoogle’s ontoRank (Ding 2005a) metric to limit the choices to popular ontologies. The result is a set of candidate pairs where each pair specifies an RDF namespace representing the ontology and a local name representing a term defined in that ontology. Given a set of candidate pairs (namespace, word), selecting the best one depends on domain information provided by users. In fact, the namespaces themselves convey lots of domain information. If users have some prior used namespaces, these may be used to determine which are the most appropriate based on correlation among namespaces observed on the Semantic Web. Lacking a history of namespace use, users may directly tell the system their general domain with a set of words and phrases, such as geography or ecoinformatics, etc. The system can then choose the most representative namespace in that domain as an implicit prior namespace for users. In sum, we can focus on one key problem – finding the namespace with the highest conditional probability in the presence of the given namespaces. Predicting Appropriate Namespaces Computing exact conditional probabilities is very expensive when the number of nodes is large. Fortunately, we need only know the rank ordering of namespaces with respect to their conditional probability, enabling us to find maximum posteriori (MAP) hypotheses. A naïve Bayes (NB) classifier approach can be used for this purpose. Although this makes a conditional independence assumption that is not true in most cases, it performs very well in text classifications and other problems. Since our training dataset is large in terms of observations and nodes, the use of NB is a practical approach because of its computational simplicity. Once we obtain an ordering of candidate namespaces with respect to their conditional probability in the presence of the input namespaces, we can select the first one that matches the namespace of any candidate pair. We have 2.5 million entries in our dataset, which comes from all RDF documents indexed by Swoogle (Ding et al. 2004), a crawler-based Semantic Web search engine that discovers and indexes documents containing RDF data. Running since 2004, Swoogle has indexed nearly 2.5M such documents, about 10K of which are ontologies that define terms. As new Semantic Web documents are discovered, Swoogle analyzes them to extract their data, compute metadata and derive statistical properties. The data is stored in a relational database and an information retrieval system (currently Lucene). In addition, a copy of the source document is stored and added to an archive of all versions of every document discovered. Each entry contains several namespaces, which are represented by their Swoogle IDs. We apply the ten-fold method to the dataset. For each time, the dataset is split into a training set and a test set. The training set is used to train a model that will output the ten most probable namespaces when given a group of input namespaces. We test the model Test by selecting an entry from the test set, randomly removing one of its several namespaces, and using the remaining namespaces as the input namespaces to the model. The target namespace is just the namespace we removed. We evaluate the model by observing if the target is among the top ten suggestions and how high on the list it appears. Judging if the target is among the top ten provides a simple test of the soundness of the model in computing MAP hypotheses. Our dataset has more than 20K distinct namespaces, which would require 20K categories for classification, making NB computationally expensive. Instead of brute force counting, we use another approach exploiting namespace locality. We observe that the namespace cooccurrence graph is very sparse – a few namespaces such as FOAF (Ding 2005b) are very widely used, but most are used with a small set of domain-related namespaces. Consequently, it makes no sense to iterate over all possible classifications to find the most probable ones.
منابع مشابه
Finding Appropriate Semantic Web Ontology Terms from Words
The Semantic Web was designed to unambiguously define and use ontologies to encode data and knowledge on the Web. Many people find it difficult, however, to write complex RDF statements and queries because doing so requires familiarity with the appropriate ontologies and the terms they define. We describe a system that automatically maps a set of ordinary English words to a set of appropriate o...
متن کاملFinding Semantic Web Ontology Terms from Words
The Semantic Web was designed to unambiguously define and use ontologies to encode data and knowledge on the Web. Many people find it difficult, however, to write complex RDF statements and queries because it requires familiarity with the appropriate ontologies and the terms they define. We describe a framework that eases the experiences in authoring and querying RDF data, in which we focus on ...
متن کاملIMPROVE THE RECOMMENDER SYSTEM USING SEMANTIC WEB
To buy his/her necessities such as books, movies, CD, music, etc., one always trusts others’ oral and written consultations and offers and include them in his/her decisions. Nowadays, regarding the progress of technologies and development of e-business in websites, a new age of digital life has been commenced with the Recommender systems. The most important objectives of these systems include a...
متن کاملQuery Architecture Expansion in Web Using Fuzzy Multi Domain Ontology
Due to the increasing web, there are many challenges to establish a general framework for data mining and retrieving structured data from the Web. Creating an ontology is a step towards solving this problem. The ontology raises the main entity and the concept of any data in data mining. In this paper, we tried to propose a method for applying the "meaning" of the search system, But the problem ...
متن کاملSemantic processing survey of spoken and written words in adolescents with cerebral palsy: Evidence from PALPA word-picture matching test
Objective: The present study aimed to assess and compare semantic processing of spoken and written words in adolescents with cerebral palsy and healthy adolescents. Method: The present study is quantitative in terms of type and experimental in terms of method. Examination Group consisted 30 adolescents with cerebral palsy aged 10 to 15 years were selected by convenience sampling method. All of ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008